Import scEGOT class and other libraries. We use os, glob, itertools, and pandas to read and reshape input data.
[2]:
import os
import glob
import itertools
import pandas as pd
import anndata
from scegot import scEGOT
Set constants
[3]:
DATASET_INPUT_ROOT_PATH = os.path.join(os.getcwd(), "dataset/input/")
RANDOM_STATE = 2023
PCA_N_COMPONENTS = 30
GMM_CLUSTER_NUMBERS = [1, 2, 4, 5, 5]
UMAP_N_NEIGHBORS = 1000
DAY_NAMES = ["day0", "day0.5", "day1", "day1.5", "day2"]
Read input files
Either AnnData or DataFrame (CSV) can be used. - In the case of AnnData: The name of the obs key containing the day information is passed as the adata_day_key argument when instantiating the scEGOT class. - In the case of DataFrame: One DataFrame must correspond to each day; prepare an array of DataFrames in the order of day.
[4]:
# ----- In the case of AnnData -----
input_data = anndata.read_h5ad(
f"{DATASET_INPUT_ROOT_PATH}scRNAseq_hPGCLC_induction_Saitou.h5ad"
)
print(input_data)
display(input_data.X)
display(input_data.obs)
display(input_data.var)
# ----- In the case of DataFrame (CSV) -----
# input_file_paths = sorted(glob.glob(f"{DATASET_INPUT_ROOT_PATH}usedataSmall/*.csv"))
# input_data = [
# pd.read_csv(input_file_path, index_col=0) for input_file_path in input_file_paths
# ]
# print(f"number of days: {len(input_data)}")
# print("shape of each day's data:")
# for i, data in enumerate(input_data):
# print(f" {DAY_NAMES[i]}: {data.shape}")
# print("data example: ")
# input_data[0].head()
AnnData object with n_obs × n_vars = 11771 × 18922
obs: 'sample', 'percent.mito', 'day', 'cluster_day'
var: 'spliced', 'unspliced', 'mt', 'TF', 'gene_name'
layers: 'X_raw', 'spliced', 'unspliced'
<11771x18922 sparse matrix of type '<class 'numpy.float32'>'
with 94477374 stored elements in Compressed Sparse Row format>
| sample | percent.mito | day | cluster_day | |
|---|---|---|---|---|
| iM.data_GTGGAAGGTCAATGGG-1 | iM.data | 0.056487 | iM | day0 |
| iM.data_TTCATGTCAACCCGCA-1 | iM.data | 0.216231 | iM | day0 |
| iM.data_GAGGGTATCCAGGACC-1 | iM.data | 0.076525 | iM | day0 |
| iM.data_AAGTCGTAGGCTTTCA-1 | iM.data | 0.080264 | iM | day0 |
| iM.data_ACCGTTCGTAACTTCG-1 | iM.data | 0.280788 | iM | day0 |
| ... | ... | ... | ... | ... |
| d2b.data_AAGCCATAGGGCGAGA-1 | d2b.data | 4.811476 | d2b | day2 |
| d2b.data_CAACCAATCTTCCGTG-1 | d2b.data | 2.554428 | d2b | day2 |
| d2b.data_AGGCCACGTGAGTAGC-1 | d2b.data | 3.142146 | d2b | day2 |
| d2b.data_GATCAGTTCGAGTACT-1 | d2b.data | 4.287140 | d2b | day2 |
| d2b.data_TCATCCGTCATGGGAG-1 | d2b.data | 2.835834 | d2b | day2 |
11771 rows × 4 columns
| spliced | unspliced | mt | TF | gene_name | |
|---|---|---|---|---|---|
| FAM3A | 1 | 1 | 0 | 0 | FAM3A |
| SLC25A1 | 1 | 1 | 0 | 0 | SLC25A1 |
| RBL1 | 1 | 1 | 0 | 0 | RBL1 |
| PPP2R1A | 1 | 1 | 0 | 0 | PPP2R1A |
| H3F3B | 1 | 1 | 0 | 0 | H3F3B |
| ... | ... | ... | ... | ... | ... |
| OR2W5 | 1 | 0 | 0 | 0 | OR2W5 |
| ODF4 | 1 | 1 | 0 | 0 | ODF4 |
| CRP | 1 | 0 | 0 | 0 | CRP |
| KRTAP4-9 | 1 | 0 | 0 | 0 | KRTAP4-9 |
| THEMIS | 1 | 1 | 0 | 0 | THEMIS |
18922 rows × 5 columns
[6]:
tf_gene_names = (
pd.read_csv(
f"{DATASET_INPUT_ROOT_PATH}TF_genes/TFgenes_name.csv", header=None, index_col=0
)
.T["cell"]
.values
)
tf_gene_names
[6]:
array(['ZBTB8B', 'GSX2', 'TBX2', ..., 'NME2', 'ZNF488', 'ZNF280B'],
dtype=object)
Tutorial
Create scEGOT instance.
You must pass values for the adata_day_key argument when using AnnData and for the day_names argument when using an array of DataFrames. adata_day_key is the key of the day in AnnData.obs, and day_names is an array containing the names of the days in each DataFrames.
[7]:
# ----- In the case of AnnData -----
scegot = scEGOT(
input_data,
verbose=True, # default=True
adata_day_key="cluster_day",
)
# ----- In the case of DataFrame (CSV) -----
# scegot = scEGOT(
# input_data,
# PCA_N_COMPONENTS,
# GMM_CLUSTER_NUMBERS,
# verbose=True, # default=True
# day_names=DAY_NAMES, # uncomment this line if you use an array of DataFrames
# )
Processing AnnData...
Preprocess:
Preprocess in the following order.
RECODE -> UMI normalize -> log1p normalize -> select variable genes -> PCA
You can also apply UMAP by calling apply_umap() after preprocess(). In the tutorial, UMAP is applied after GMM is applied.
[8]:
%%time
X, pca_model = scegot.preprocess(
pca_n_components=PCA_N_COMPONENTS,
recode_params={}, # default={}
umi_target_sum=1e5, # default=1e4
pca_random_state=RANDOM_STATE, # default=None
pca_other_params={}, # default={}
apply_recode=True, # default=True
apply_normalization_log1p=True, # default=True
apply_normalization_umi=True, # default=True
select_genes=True, # default=True
n_select_genes=2000, # default=2000
)
Applying RECODE...
start RECODE for scRNA-seq data
end RECODE for scRNA-seq
log: {'seq_target': 'RNA', '#significant genes': 15820, '#non-significant genes': 2582, '#silent genes': 65, 'ell': 288, 'Elapsed time': '0h 0m 25s 867ms', 'stat_learning': True, '#test_data': 2354}
Applying UMI normalization...
Applying log1p normalization...
Applying PCA...
sum of explained_variance_ratio = 84.9264145902248
CPU times: user 2min 37s, sys: 15.8 s, total: 2min 53s
Wall time: 42.1 s
Apply GMM:
Apply GMM and visualize the results in each case of PCA and UMAP. To switch between PCA and UMAP displays, specify the mode argument to the plotting function.
[9]:
%%time
gmm_models, gmm_labels = scegot.fit_predict_gmm(
n_components_list=GMM_CLUSTER_NUMBERS,
covariance_type="full", # default="full"
max_iter=2000, # default=2000
n_init=10, # default=10
random_state=RANDOM_STATE, # default=None
gmm_other_params={}, # default={}
)
Fitting GMM models with each day's data and predicting labels for them...
100%|█████████████████████████████████████████████████████████████| 5/5 [02:54<00:00, 34.96s/it]
CPU times: user 19min 38s, sys: 51.5 s, total: 20min 29s
Wall time: 2min 54s
Visualization in PCA
[10]:
scegot.plot_gmm_predictions(
mode="pca", # default="pca"
plot_gmm_means=True, # defalut=False
figure_titles_without_gmm=[f"{name} on PCA" for name in DAY_NAMES], # default=None
figure_titles_with_gmm=[f"{name} with GMM" for name in DAY_NAMES], # default=None
cmap="viridis", # default="plasma"
# save=True, # default=False
# save_paths=[f"./gmm_preds_{day_name}.png" for day_name in DAY_NAMES] # default=None
)
Apply UMAP
[11]:
%%time
X_umap, umap_model = scegot.apply_umap(
n_neighbors=UMAP_N_NEIGHBORS,
n_components=2, # default=2
random_state=RANDOM_STATE, # default=None
min_dist=0.8, # default=0.1
umap_other_params={}, # default={}
)
CPU times: user 1min 58s, sys: 2.11 s, total: 2min
Wall time: 48.9 s
Visualization in UMAP
[12]:
scegot.plot_gmm_predictions(
mode="umap", # default="pca"
plot_gmm_means=True, # defalut=False
figure_titles_without_gmm=[f"{name} on UMAP" for name in DAY_NAMES], # default=None
figure_titles_with_gmm=[f"{name} with GMM" for name in DAY_NAMES], # default=None
cmap="plasma", # default="plasma"
# save=True, # default=False
# save_paths=[f"./gmm_preds_{day_name}.png" for day_name in DAY_NAMES] # default=None
)
Cell State:
You can visualize cell state in two patterns: animated and graphical
Animation
[13]:
scegot.animate_cell_state(
cmap="rainbow", # default="rainbow"
interpolate_interval=11, # default=11
# save=True, # default=False
# save_path="./cell_state_video.gif", # default=None
)
Interpolating between day0 and day0.5...
100%|███████████████████████████████████████████████████████████| 11/11 [00:02<00:00, 3.97it/s]
Interpolating between day0.5 and day1...
100%|███████████████████████████████████████████████████████████| 10/10 [00:07<00:00, 1.39it/s]
Interpolating between day1 and day1.5...
100%|███████████████████████████████████████████████████████████| 10/10 [00:08<00:00, 1.19it/s]
Interpolating between day1.5 and day2...
100%|███████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.11s/it]
Creating animation...
Graph
There are two functions for visualizing cell state graphs, one is plot_cell_state_graph() and the other is plot_simple_cell_state_graph(). Both can visualize PCA and UMAP.
Before plotting the graph, call make_cell_state_graph() to create a graph object. At that time, cluster names must be specified as an array. Of course, you can create them manually, or there is a function called generate_cluster_names_with_day() that automatically generates cluster names for you.
Since cluster names are not maintained in the scEGOT class, an array of cluster names must be passed to the function that draws the cluster names each time.
There will be other occasions where cluster names are passed in the pathway and cell velocity sections, but it is not necessary for the cluster names in those cases to match the cluster names used in this section. Therefore, the cluster name can be changed flexibly.
[14]:
cluster_names = scegot.generate_cluster_names_with_day()
cluster_names
[14]:
[['day0-0'],
['day0.5-0', 'day0.5-1'],
['day1-0', 'day1-1', 'day1-2', 'day1-3'],
['day1.5-0', 'day1.5-1', 'day1.5-2', 'day1.5-3', 'day1.5-4'],
['day2-0', 'day2-1', 'day2-2', 'day2-3', 'day2-4']]
In the case of plot_cell_state_graph()
Gene information is displayed in the graph.
PCA ver.
[15]:
G = scegot.make_cell_state_graph(
cluster_names,
mode="pca", # default="pca"
threshold=0.05, # default=0.05
)
scegot.plot_cell_state_graph(
G,
cluster_names,
tf_gene_names,
tf_gene_pick_num=5, # default=5
# save=True, # default=False
# save_path="./cell_state_graph_pca.png" # default=None
)
UMAP ver.
[16]:
G_umap = scegot.make_cell_state_graph(
cluster_names,
mode="umap", # default="pca"
threshold=0.05, # default=0.05
)
scegot.plot_cell_state_graph(
G_umap,
cluster_names,
tf_gene_names,
tf_gene_pick_num=5, # default=5
# save=True, # default=False
# save_path="./cell_state_graph_umap.png" # default=None
)
In the case of plot_simple_cell_state_graph()
The shape of the graph can be changed by changing the layout argument.
When
layout="hierarchy"andorder="weight", the nodes are sorted in order of node weight.If the order argument is not specified, the nodes are sorted in the order of their gmm labels.
PCA ver.
[18]:
scegot.plot_simple_cell_state_graph(
G,
layout="normal", # default="normal"
# save=True, # default=False
# save_path="./simple_cell_state_graph_pca.png" # default=None
)
UMAP ver.
[19]:
scegot.plot_simple_cell_state_graph(
G_umap,
layout="normal", # default="normal"
# save=True, # default=False
# save_path="./simple_cell_state_graph_umap.png" # default=None
)
layout=”hierarchy”
[20]:
scegot.plot_simple_cell_state_graph(
G,
layout="hierarchy", # default="normal"
order="weight", # default=None
# save=True, # default=False
# save_path="./simple_cell_state_graph_hierarchy.png" # default=None
)
Fold Change:
You can see the fold change between the two clusters.
[21]:
cluster1 = "day0.5-0"
cluster2 = "day0.5-1"
scegot.plot_fold_change(
cluster_names,
tf_gene_names,
cluster1,
cluster2,
threshold=0.8, # default=1.0
# save=True, # default=False
# save_path="./fold_change.png" # default=None
)
Pathway:
Visualize the transition of expression levels of selected_genes in the clusters specified by pathway_names.
[22]:
pathway_names = [
"day0-0",
"day0.5-1",
"day1-1",
"day1.5-4",
"day2-2",
]
scegot.plot_pathway_mean_var(
cluster_names,
tf_gene_names,
pathway_names,
threshold=1.0, # default=1.0
# save=True, # default=False
# save_path="./pathway_mean_var.png" # default=None
)
[23]:
pathway_names = [
"day0-0",
"day0.5-1",
"day1-1",
"day1.5-4",
"day2-2",
]
selected_genes = ["NANOG", "SOX17", "TFAP2C", "NKX1-2"]
scegot.plot_pathway_gene_expressions(
cluster_names,
tf_gene_names,
pathway_names,
selected_genes,
# save=True, # default=False
# save_path="./pathway_gene_expressions.png" # default=None
)
Features of single gene:
You can also plot only the features of a single gene.
2D
PCA ver.
[24]:
gene_name = "NKX1-2"
scegot.plot_pathway_single_gene_2d(
gene_name,
mode="pca", # default="pca"
# save=True, # default=False
# save_path="./pathway_single_gene_2d_pca.png" # default=None
)
UMAP ver.
[25]:
gene_name = "NKX1-2"
scegot.plot_pathway_single_gene_2d(
gene_name,
mode="umap", # default="pca"
# save=True, # default=False
# save_path="./pathway_single_gene_2d_umap.png" # default=None
)
3D
[26]:
gene_name = "NKX1-2"
scegot.plot_pathway_single_gene_3d(
gene_name,
# save=True, # default=False
# save_path="./pathway_single_gene_3d.html" # default=None
)
Interpolation:
Accuracy check of interpolation
For each of PCA and UMAP, interpolated and correct data can be compared. Also, source and target can be toggled on and off.
PCA ver.
[27]:
scegot.plot_true_and_interpolation_distributions(
interpolate_index=2,
mode="pca", # default="pca"
n_samples=1000, # default=2000
t=0.5, # default=0.5
plot_source_and_target=True, # default=True
alpha_true=0.5, # default=0.5
# save=True, # default=False
# save_path="./true_and_interpolation_distributions_pca.png" # default=None
)
UMAP ver.
[28]:
scegot.plot_true_and_interpolation_distributions(
interpolate_index=2,
mode="umap", # default="pca"
n_samples=500, # default=2000
t=0.5, # default=0.5
plot_source_and_target=False, # default=True
alpha_true=0.5, # default=0.5
# save=True, # default=False
# save_path="./true_and_interpolation_distributions_pca.png" # default=None
)
Transforming interpolated data with UMAP...
Video visualization of feature changes in a single gene
PCA ver.
[42]:
target_gene = "NKX1-2"
scegot.animate_interpolation(
target_gene,
mode="pca", # default="pca"
interpolate_interval=11, # default=11
n_samples=5000,
cmap="gnuplot2", # default="rainbow"
# save=True, # default=False
# save_path="./interpolate_video_pca.gif" # default=None
)
Interpolating between day0 and day0.5...
100%|███████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 16.56it/s]
Interpolating between day0.5 and day1...
100%|███████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 11.65it/s]
Interpolating between day1 and day1.5...
100%|███████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 9.41it/s]
Interpolating between day1.5 and day2...
100%|███████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 6.91it/s]
Creating animation...
UMAP ver.
[ ]:
%%time
target_gene = "NKX1-2"
scegot.animate_interpolation(
target_gene,
mode="umap",
interpolate_interval=11,
n_samples=1000,
cmap="gnuplot2",
# save=True, # default=False
# save_path="./interpolate_video_umap.gif" # default=None
)
Cell velocity:
Plotting cell velocity
After the velocity is calculated using calculate_cell_velocities(), visualize it with plot_cell_velocity(). Both PCA and UMAP are supported.
PCA ver.
[30]:
velocities = scegot.calculate_cell_velocities(
mode="pca", # default="pca"
)
scegot.plot_cell_velocity(
velocities,
mode="pca", # default="pca"
cmap="rainbow", # default="rainbow"
# save=True, # default=False
# save_path="./cell_velocity_pca.png" # default=None
)
Calculating cell velocities between each day...
100%|█████████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.06it/s]
UMAP ver.
[31]:
%%time
velocities_umap = scegot.calculate_cell_velocities(
mode="umap", # default="pca"
)
scegot.plot_cell_velocity(
velocities_umap,
mode="umap", # default="pca"
cmap="rainbow", # default="rainbow"
# save=True, # default=False
# save_path="./cell_velocity_umap.png" # default=None
)
Calculating cell velocities between each day...
100%|█████████████████████████████████████████████████████████████| 4/4 [03:59<00:00, 59.94s/it]
CPU times: user 4min 14s, sys: 5.42 s, total: 4min 19s
Wall time: 3min 59s
Interpolation of cell velocity
You can color the clusters and / or the streams.
PCA ver.
[32]:
scegot.plot_interpolation_of_cell_velocity(
velocities,
mode="pca", # default="pca"
color_streams=False, # default=False
color_points="gmm", # default="gmm"
cluster_names=list(itertools.chain.from_iterable(cluster_names)), # default=None
cmap="tab20", # default="rainbow"
# save=True, # default=False
# save_path="./cell_velocity_interpolation_pca_point.png" # default=None
)
[33]:
scegot.plot_interpolation_of_cell_velocity(
velocities,
mode="pca", # default="pca"
color_streams=True, # default=False
color_points=None, # default="gmm"
cluster_names=None, # default=None
cmap="rainbow", # default="rainbow"
# save=True, # default=False
# save_path="./cell_velocity_interpolation_pca_stream.png" # default=None
)
UMAP ver.
[34]:
scegot.plot_interpolation_of_cell_velocity(
velocities_umap,
mode="umap", # default="pca"
color_streams=False, # default=False
color_points="gmm", # default="gmm"
cluster_names=list(itertools.chain.from_iterable(cluster_names)), # default=None
cmap="tab20", # default="rainbow"
# save=True, # default=False
# save_path="./cell_velocity_interpolation_umap_point.png" # default=None
)
[35]:
scegot.plot_interpolation_of_cell_velocity(
velocities_umap,
mode="umap", # default="pca"
color_streams=True, # default=False
color_points=None, # default="gmm"
cluster_names=None, # default=None
cmap="rainbow", # default="rainbow"
# save=True, # default=False
# save_path="./cell_velocity_interpolation_umap_stream.png" # default=None
)
Gene regulatory network:
You can visualize the Gene regulatory network between genes specified in selected_genes.
[36]:
%%time
GRNs, ridgeCVs = scegot.calculate_grns(
selected_clusters=None, # default=None
alpha_range=(-2, 2), # default=(2, 2)
cv=3, # default=3
ridge_cv_fit_intercept=False, # default=False
ridge_fit_intercept=False, # default=False
)
Calculating GRNs between each day...
100%|█████████████████████████████████████████████████████████████| 4/4 [00:04<00:00, 1.05s/it]
CPU times: user 28.9 s, sys: 1.61 s, total: 30.5 s
Wall time: 4.2 s
[37]:
selected_genes = ["TFAP2C", "SOX17", "PRDM1", "NKX1-2"]
scegot.plot_grn_graph(
GRNs,
ridgeCVs,
selected_genes,
threshold=0.01, # default=0.01
# save=True, # default=False
# save_paths=[f"./grn_{day_name}.png" for day_name in DAY_NAMES] # default=None
# save_format="png" # default="png"
)
alpha = 8.858667904100823
alpha = 37.92690190732246
alpha = 37.92690190732246
alpha = 100.0
By passing an array of cluster names to selected_clusters, you can specify the clusters to be used in the GRN calculation.
[38]:
%%time
selected_clusters = [[0, 0], [1, 0], [2, 0], [3, 0]]
GRNs_cluster, ridgeCVs_cluster = scegot.calculate_grns(
selected_clusters=selected_clusters, # default=None
alpha_range=(-2, 2), # default=(2, 2)
cv=3, # default=3
ridge_cv_fit_intercept=False, # default=False
ridge_fit_intercept=False, # default=False
)
Calculating GRNs between each day...
100%|█████████████████████████████████████████████████████████████| 4/4 [00:04<00:00, 1.06s/it]
CPU times: user 28.9 s, sys: 1.77 s, total: 30.7 s
Wall time: 4.22 s
[39]:
selected_genes = ["TFAP2C", "SOX17", "PRDM1", "NKX1-2"]
scegot.plot_grn_graph(
GRNs_cluster,
ridgeCVs_cluster,
selected_genes,
threshold=0.01, # default=0.01
# save=True, # default=False
# save_paths=[f"grn_selected_clusters_{day_name}" for day_name in DAY_NAMES] # default=None
# save_format="png" # default="png"
)
alpha = 8.858667904100823
alpha = 14.38449888287663
alpha = 2.06913808111479
alpha = 5.455594781168514
Waddington Landscape:
Both PCA and UMAP are supported.
[43]:
%%time
Wpotential, F_all = scegot.calculate_waddington_potential(
n_neighbors=100, # default=100
knn_other_params={}, # default={}
)
Calculating F between each day...
100%|█████████████████████████████████████████████████████████████| 4/4 [00:51<00:00, 12.80s/it]
Applying knn ...
Computing kernel ...
CPU times: user 5min 40s, sys: 29.7 s, total: 6min 10s
Wall time: 53.5 s
Plots Waddington potential in 3 dimensions. If gene_name is specified, the graph is colored by the expression level of the gene, and if gene_name is None, the graph is colored by waddington potential.
PCA ver.
[44]:
scegot.plot_waddington_potential(
Wpotential,
mode="pca", # default="pca"
gene_name=None, # default=None
# save=True, # default=False
# save_path="./waddington_potential_pca_potential.html" # default=None
)
[45]:
gene_name = "NKX1-2"
scegot.plot_waddington_potential(
Wpotential,
mode="pca", # default="pca"
gene_name=gene_name, # default=None
# save=True, # default=False
# save_path="./waddington_potential_pca_expression.html" # default=None
)
UMAP ver.
[46]:
scegot.plot_waddington_potential(
Wpotential,
mode="umap", # default="pca"
gene_name=None, # defautl=None
# save=True, # default=False
# save_path="./waddington_potential_umap_potential.html" # default=None
)
[47]:
gene_name = "NKX1-2"
scegot.plot_waddington_potential(
Wpotential,
mode="umap",
gene_name=gene_name,
# save=True, # default=False
# save_path="./waddington_potential_umap_expression.html" # default=None
)
You can also display a landscape instead of a plot of points.
PCA ver.
[48]:
scegot.plot_waddington_potential_surface(
Wpotential,
mode="pca",
# save=True, # default=False
# save_path="./waddington_potential_surface_pca.html", # default=None
)
UMAP ver.
[49]:
scegot.plot_waddington_potential_surface(
Wpotential,
mode="umap",
# save=True, # default=False
# save_path="./waddington_potential_surface_umap.html", # default=None
)
[ ]: